A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension
نویسندگان
چکیده
The data management process is characterised by a set of tasks where quality (DQM) one the core components. Data quality, however, multidimensional concept, nature issues very diverse. One most widely anticipated challenges, which becomes particularly vital when come from multiple sources typical situation in current data-driven world, duplicates or non-uniqueness. Even more, were recognised to be key domain-specific dimensions context Internet Things (IoT) application domains, smart grids and health dominate most. Duplicate lead inaccurate analyses, leading wrong decisions, negatively affect and/or processing activities such as development models, forecasts, simulations, have negative impact on customer service, risk crisis management, service personalisation terms both their accuracy trustworthiness, decrease user adoption satisfaction, etc. determination elimination known deduplication, while finding more databases that refer same entities Record Linkage. To find duplicates, sets are compared with each other using similarity functions usually used compare two input strings similarities between them, requires quadratic time complexity. defuse complexity problem, especially large sources, record linkage methods, blocking sorted neighbourhood, used. In this paper, we propose six-step deduplication framework. operation framework demonstrated simplified example research artifacts, publications, projects others real-world institution representing Research Information Systems (RIS) domain. make proposed usable integrated it into tool already practice, developing prototype an extension for well-known DataCleaner. detects visualises thereby identifying providing identified redundancies user-friendly manner allowing further elimination. By removing redundancies, improved therefore improving analyses decision-making. This study makes call researchers take step towards “golden record” can achieved all resolved, thus moving absolute quality.
منابع مشابه
Probabilistic Deduplication, Record Linkage and Geocoding
Outline Background and illustrative example Record linkage Applications, privacy and ethics Our project and our tools Data cleaning and standardisation Probabilistic data standardisation and HMMs Blocking / indexing Record pair classification Geocoding Outlook Peter Christen, May 2005 – p.2/28
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملA Probabilistic Deduplication, Record Linkage and Geocoding System
In many data mining projects in the health sector information from multiple data sources needs to be cleaned, deduplicated and linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient. Most of the time the linkage process is challenged by the lack of a common unique entity identifier. Additionally, personal ...
متن کاملAn Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework
Record linkage is an important process in data integration, which is used in merging, matching and duplicate removal from several databases that refer to the same entities. Deduplication is the process of removing duplicate records in a single database. In recent years, data cleaning and standardization becomes an important process in data mining task. Due to complexity of today’s database, fin...
متن کاملData Fusion with Record Linkage
Assuming that there are two sources (e.g. les), which consist of records with diierent informations about some units like people. We want to fusion the information (data) that belong to the same units. Very often in practice no identiication numbers | like the Social Security Number SSN | are available at both les, that's why there is some uncertainity, which records belong together. Anyway, we...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Multimodal technologies and interaction
سال: 2022
ISSN: ['2414-4088']
DOI: https://doi.org/10.3390/mti6040027